The first high-quality chromosome-level genome of the Sipuncula Sipunculus nudus using HiFi and Hi-C data

Sipuncula is a class of exocoelomic unsegmented animals whose evolutionary relationships are unresolved. The peanut worm Sipunculus nudus is a globally distributed, economically important species belonging to the class Sipuncula. Herein, we present the first high-quality chromosome-level assembly of S. nudus based on HiFi reads and high-resolution chromosome conformation capture (Hi-C) data. The assembled genome was 1,427 Mb, with a contig N50 length of 29.46 Mb and scaffold N50 length of 80.87 Mb. Approximately 97.91% of the genome sequence was anchored to 17 chromosomes. A BUSCO assessment showed that 97.7% of the expectedly conserved genes were present in the genome assembly. The genome was composed of 47.91% repetitive sequences, and 28,749 protein-coding genes were predicted. A phylogenetic tree demonstrated that Sipuncula belongs to Annelida and diverged from the common ancestor of Polychaeta. The high-quality chromosome-level genome of S. nudus will serve as a valuable reference for studies of the genetic diversity and evolution of Lophotrochozoa.


Background & Summary
Sipuncula (peanut worms) are unsegmented coelomate worms with bilaterally symmetrical bodies that are separated into a trunk and are retractable introverts 1 . Belonging to Lophotrochozoa, they are believed to form a small phylum with approximately 150 described species 2 . However, they are widely distributed in the world's oceans at all depths, occupying most marine habitats, from intertidal zones to abyssal depths and polar to equatorial seas, including extreme environments. Over the past 520 million years, the typical features of extant Sipuncula have undergone only minor changes 3 . Therefore, Sipuncula is an exciting resource to study environmental adaptation and evolution and as an indicator of global climate change. In coastal environments, these species are critical in bioturbation to reshape the physicochemical properties and biological characteristics of the sediment 4 . In marine wetlands and pond aquaculture systems, Sipuncula and other taxa increase organic matter transport and improve ecosystem services 5 . However, gene and genome data for Sipuncula that are available in the PDB, a public database, are insufficient.
Despite the early recognition of the group, phylogenetic relationships between Sipuncula and other taxa are unclear. Sipunculus nudus was first described by Linnaeus in 1767 and was later considered to be a derived group of annelids [6][7][8] . Morphological and developmental characteristics suggest that Sipuncula is the sister group of Mollusca 9 . However, phylogenetic analyses based on mitochondrial DNA sequences as well as traits related to nervous and muscle system development indicate that Sipuncula is more closely related to Annelida than to Mollusca 10,11 . Torsten et al. performed phylogenomic analyses using 47,953 amino acid positions to explore the relationships among 34 annelid taxa and found that Sipuncula belongs to Annelida 12 . Therefore, the assignment of Sipuncula to annelids is still a controversial issue. Furthermore, the lack of segments in Sipuncula, which is www.nature.com/scientificdata www.nature.com/scientificdata/ Genome survey and assembly. The size, heterozygosity, and repeat rate of the S. nudus genome were estimated using the k-mer frequency method. Jellyfish 15 and GenomeScope v.1.0 16 were employed to calculate the K-mer frequency (k = 21), which was based on HiFi reads, and the genome size was estimated to be 1305 Mb with a peak K-mer frequency of 66X. The heterozygosity and repeat rate were 2.03% and 39.68%, respectively (Fig. 2). We first assembled the genome using HiFi reads via HiFi-asm (v0.15.1) 17 with default parameters. After preliminary assembly, we used purge_haplotigs 18 to purge haplotigs. The haploid genome size was 1426.68 Mb, and the N50 length was 29.46 Mb ( Fig. 3 and Table 3).
The contigs were anchored to chromosomes using Hi-C data. Juicer (version 1.6) 19 was used to align the double-ended sequencing data against the assembled genome to complete the evaluation of the Hi-C library. The 3D-DNA pipeline 20 under default parameters without breaking contigs was chosen to generate the final chromosome-level scaffolds. Manual checking and refinement of the draft assembly were carried out via Juicebox Assembly Tools (https://github.com/aidenlab/Juicebox, v1.1108). A heatmap of the Hi-C assembly interaction bins indicated that the quality of the genome assembly was excellent (Fig. 4). The length of the final assembled genome was 1,426,776,655 bp, with a contig N50 of 29,460,569 bp and scaffold N50 of 80,869,746 bp (Table 3 and Fig. 3). Approximately 1,397 Mb (97.91%) of the contig sequences were anchored to 17 chromosomes (Table 4), which is consistent with the known karyotype in our previously published manuscript 21 . Using the minimap2 (v2.17, parameters: -a -x map-pb) 22 alignment results and the HiFi data, we used BamDeal (https://github.com/BGI-shenzhen/BamDeal) to evaluate the mapping rate and coverage and obtained estimates of 99.95% and 99.73%, respectively. The CIRCOS tool 23 was used to visualize the 17 chromosomes, GC content, read depth and mapping depth (Fig. 5). The average depth of each chromosome was calculated and is shown in Fig. 6. Seventeen chromosomes had a comparable sequencing depth, and there was no whole chromosome with half the read depth. Therefore, XY-or ZW-type sex chromosomes did not exist in the assembled chromosomes of S. nudus. Based on 20-kb nonoverlapping sliding windows in the chromosomes to calculate the GC content and read average depth, there was a small cluster of sliding windows (a total of 11.6 Mb with 581 sequences) that exhibited relatively high GC contents ( > 48%) but with a normal sequencing depth (Fig. 7). By extracting those block sequences with high GC contents and mapping them to the NT database (Nucleotide Sequence Database) using MegaBlast (parameter: −e 1e-5), the alignments with identity >90% and coverage length >100 bp were filtered. The matched reference species in the alignments from the NT database were grouped into three categories: the S. nudus species, the species of other invertebrates, and all other species except the two mentioned above. All the matched sequences (228) could be correlated with S. nudus or other invertebrate species (Fig. 8), which demonstrated that the sequence blocks with high GC content and normal depth in chromosomes were from the S. nudus species rather than from contamination or cobionts.
www.nature.com/scientificdata www.nature.com/scientificdata/ Repeat annotation. Prior to gene prediction, we identified the repetitive elements in the genome of S. nudus by using a combination of homology-based and ab initio-based methods. To identify tandem repeats, we used Tandem Repeats Finder v4.09 24 . For the homology-based method, transposable elements were identified by RepeatMasker v4.0.7 (-nolow -no_is -norna -engine ncbi -parallel 1) and RepeatProteinMask v4.0.7 (-engine ncbi -noLowSimple -pvalue 0.0001) 25 against the TE protein databases and RepBase library v21.12 26 . For the ab initio-based method, LTR_FINDER v1.06 27 and RepeatModeler v1.0.8 (http://repeatmasker.org/RepeatModeler/) with default parameters were used to build the de novo library before RepeatMasker v4.0.7 was used to classify the different categories of repetitive elements against this library. The final repetitive elements were identified by integrating the results of these methods according to sequence overlap, revealing that nearly half of the genome consists of repetitive elements (Tables 5, 6; Fig. 5).
Gene prediction. Gene annotation was performed by integrating homology-, de novo-and transcriptome-based information. We used the annotation data from three closely related species (Caenorhabditis elegans, Capitella teleta, and Helobdella robusta) for homology prediction. The MAKER tool 28 was used to integrate the annotation data from the three related species and the transcriptome data from S. nudus. Based on AED

Fig. 2
Overview of the 21-mer frequency distribution in the S. nudus genome. The X-axis is the k-mer depth, and the Y-axis represents the k-mer frequency for a given depth.  www.nature.com/scientificdata www.nature.com/scientificdata/ values from MAKER, 2000 genes with complete structures were selected and used to train the de novo prediction tools Augustus 29 and Snap 30 to construct de novo models. Finally, all data were integrated using MAKER 28 . The final comprehensive gene set contained 28,749 genes (Table 7).

Data Records
The National Center for Biotechnology Information (NCBI) BioProject accession number for the sequence reported in this paper is PRJNA901211. The raw data for Hi-Fi and Hi-C sequencing were submitted to NCBI SRA (accession number SRP408321; https://identifiers.org/ncbi/insdc.sra:SRP408321) 42 and deposited in the CNGB Sequence Archive (CNSA) of the China National GeneBank DataBase (CNGBdb) (accession number CNR0640303-CNR0640323; https://db.cngb.org/search/project/CNP0003624/) 43 . The assembled genome sequence was deposited into NCBI under accession number JAPPUL000000000 44 . The assembled genome, gene structure annotation, repeat predictions, gene function annotation, KEGG analysis of expanded genes and positively selected gene data were deposited in the China National GeneBank DataBase (CNGBdb) under the project with accession number CNP0003624.

Technical Validation
Genome assembly and gene prediction quality assessment. The BUSCO pipeline was used to evaluate the completeness of the genome assembly and gene set based on a benchmark of 255 conserved genes in eukaryota_odb10 (creation date: 2020-09-10, number of genomes: 70, number of BUSCOs: 255). In total, 97.7% of the 255 expected conserved genes were identified as complete, and 2% were identified as fragmented. Furthermore, we used minimap2 (v2.17, parameters: -a -x map-pb) 22 to align the assembly with the HiFi data, and the mapping rate and coverage rate were estimated to be 99.95% and 99.73%, respectively. The BUSCO (v5) 45 results supported the completeness of the assembly; 97.7% of 255 conserved genes were identified as complete by using eukaryota_odb10 ( Table 9). The BUSCO results and alignment results indicated high genome assembly completeness and correctness.  Table 10. Among the orthologous genes in the 16 species, a total of 255 single-copy genes were identified. The single-copy orthologues were aligned using MUSCLE (v3.7) 47 with default parameters, and then the aligned protein sequences were reverse translated into codon sequences. The alignments were then concatenated to generate a superalignment matrix for phylogenetic reconstruction based on the maximum-likelihood (ML) method using IQ-TREE (v1.6.12) 48 , with the best-fit evolutionary substitution model being determined using ModelFinder 49 . Divergence times for each node in the phylogenetic tree were estimated using MCMCtree, which is implemented in PAML package v4.8a 50 , under the following parameters: -nsample 100000, -rootage 800, and -burnin 500000. The calibration times were obtained from TimeTree 51 Fig. 9. Gene collinearity, which shows the preservation of ancestral genome structure in the modern genome, is an important means of unveiling genomic evolution. Thus, MCscan (Python version) 52 was used for the genomic synteny analysis between S. nudus, O. fusiformis and P. esculenta. The collinearity figure was drawn based on the homologous blocks with ≥ 4 gene collinear pairs between species by JCVI (https://github.com/tanghaibao/jcvi) (Fig. 10). Regarding

Sipunculus nudus
Other invertebrates others Fig. 8 The categories of reference species in the alignments from the NT database.
The time-calibrated phylogenetic tree was used to assess gene family expansions and contractions using CAFÉ 4.2.1 53 with a random birth-and-death model with lambda. In total, 543 and 97 significantly expanded and contracted gene families were identified, respectively (P < 0.05). GO and KEGG enrichment analyses of      www.nature.com/scientificdata www.nature.com/scientificdata/ the expanded gene families revealed that these families were mainly involved in pathways that are related to apoptosis, detoxification, the immune response, amino acid and fatty acid metabolism anion, oxidative stress, and energy metabolism.
PSGs (positively selected genes) were predicted using branch-site likelihood ratio tests for single-copy gene families with a conservative 10% false discovery rate (FDR) criterion 54 . We used proteins from S. nudus, C. teleta, E. Andrei, L. satsuma, O. fusiformis, and P. esculenta to extract 3,192 one-to-one orthologous genes using the OrthoFinder (v2.3.11) pipeline. The one-to-one orthologous genes were then used to generate multiple sequence alignments by using PRANK (v. 121002) 55 . The d N /d S ratios of the codons were calculated using the branch-site model of Codeml in the PAML package 50 , in which S. nudus was set as the foreground branch and the other five taxa as background branches. Using a likelihood ratio test (LRT) of ≤0.05 and an FDR of ≤0.05 as thresholds, 326 PSGs were identified in the S. nudus genome. These PSGs were significantly enriched in the terms "Spliceosome, " "Base excision repair, " "DNA replication, " and "Cell cycle" in the KEGG pathway enrichment analysis.
In summary, we obtained the high-quality chromosome-level genome of S. nudus, which contributes to our understanding of the evolutionary status of Sipuncula and the evolutionary relationship among the subgroups of the phylum Annelida. Gene family expansion and extraction and genomic synteny analyses revealed the potential adaptation mechanism of Sipuncula to different living environments.

Usage Notes
All analyses were run on Linux systems, and the optimal parameters are given in the Code availability section.

Code availability
No specific code or script was used in this work. Commands used for data processing were all executed according to the manuals and protocols of the corresponding software.